Distributed Duplicate Detection in Post-Process Data De-duplication

نویسندگان

Atish Kathpal

Matthew John

Gaurav Makkar

چکیده

Data De-duplication is essentially a data compression technique for elimination of coarse-grained redundant data. A typical flavor of de-duplication detects duplicate data blocks within the storage device and de-duplicates them by placing pointers rather than storing multiple copies at various places within the disk. Since the advent of deduplication the conventional approach has been to scale-up de-duplication at a storage controller by using more of the controller resources. This approach has led to several bottlenecks including the most evident one of hogging controller resources, in-turn leading to limiting the number of concurrent de-duplication threads running on the controller, finally ending up with poor de-duplication performance. Going by the rate at which we are experiencing data explosion, with data becoming the core entity separating one organization from other, high performing scalable de-duplication is one challenge organizations are already starting to face. Through the current effort, we propose a scalable design of a distributed de-duplication system which leverages clusters of commodity nodes to scale-out suitable tasks of a typical de-duplication system. We explain our distributed duplicate detection workflow, implemented in Hadoop’s map-reduce programming abstraction. We also discuss the performance statistics we obtained with the scale-out de-duplication model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cloud Based Data Deduplication with Secure Reliability

IJRAET Abstract— To eliminate duplicate copies of data we use data de-duplication process. As well as it is used in cloud storage to minimize memory space and upload bandwidth only one copy for every file stored in cloud that can be used by more number of users. Deduplication process helps to improve storage space. Another challenge of privacy for sensitive data also arises. The aim of this pap...

متن کامل

Duplicate Web Pages Detection with the Support of 2d Table Approach

Duplicate and near duplicate web pages are stopping the process of search engine. As a consequence of duplicate and near duplicates, the common issue for the search engines is raising the indexed storage pages. This high storage memory will slow down the process which automatically increases the serving cost. Finally, the duplication will be raised while gathering the required data from the var...

متن کامل

Cluster Based Duplicate Detection

We propose a clustering technique for entropy based text dis-similarity calculation of de-duplication system. Improve the quality of grouping; in this study we propose a Multi-Level Group Detection (MLGD) algorithm which produces a most accurate group with most closely related object using Alternative Decision Tree (ADT) technique. Our propose a two new algorithm; first one is Multi-Level Group...

متن کامل

RefConcile - Automated Online Reconciliation of Bibliographic References

Comprehensive bibliographies often rely on community contributions. In such settings, de-duplication is mandatory for the bibliography to be useful. Ideally, de-duplication works online, i.e., when adding new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated re...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Distributed Duplicate Detection in Post-Process Data De-duplication

نویسندگان

چکیده

منابع مشابه

Cloud Based Data Deduplication with Secure Reliability

Duplicate Web Pages Detection with the Support of 2d Table Approach

Cluster Based Duplicate Detection

RefConcile - Automated Online Reconciliation of Bibliographic References

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

عنوان ژورنال:

اشتراک گذاری